Harden host-side sparse index width and allocation safety#2822
Draft
LwhJesse wants to merge 2 commits into
Draft
Harden host-side sparse index width and allocation safety#2822LwhJesse wants to merge 2 commits into
LwhJesse wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Harden Host-Side Sparse Index Width and Allocation Safety
Summary
This PR hardens the host-side sparse linear algebra path against integer-width overflow and silent narrowing.
It introduces an explicit 64-bit host sparse-index alias,
and uses it in the host sparse-pattern / CSR metadata path where index width and allocation size matter. It also adds checked multiplication for the main storage-size calculations and checked narrowing at the remaining external boundaries.
This PR does not modify the CUDA backend implementation itself. In particular, it does not change:
Common/src/linear_algebra/CSysMatrixGPU.cuCommon/include/linear_algebra/GPUComms.cuhCUDA matvec correctness remains the responsibility of PR #2816.
Problem
Several sparse linear algebra paths in SU2 currently rely on
unsigned longorintfor sparse metadata, storage-size arithmetic, or boundary conversion.On LP64 systems this often goes unnoticed because
unsigned longis usually 64-bit. On LLP64 systems such as 64-bit Windows,unsigned longis still 32-bit. At sufficiently large local sparse-matrix sizes, that can lead to:nnz * nVar * nEqnFor the main matrix storage, the critical host-side product is:
If that is evaluated in 32-bit arithmetic, the largest representable unsigned value is:
So the first overflowing matrix-scalar count is:
For
double, that means:which is:
32 GiBbinary34.36 GBdecimalThat is the origin of the host-side estimate quoted above.
To connect that number to an actual sparse matrix, take a common square-block case with
nVar = nEqn = 5. Then each nonzero block contributes25matrix scalars, so the first overflowing host-side case is:This is already larger than
2^32 - 1 = 4,294,967,295, so the host-side 32-bit product has crossed the limit. In memory terms, that same case corresponds to:which is again about:
32 GiBbinary34.36 GBdecimalSo the quoted
34.36 GBnumber is not a separate back-of-the-envelope estimate; it is simply the byte size of the first5 x 5double-precision sparse matrix whosennz_blocks * nVar * nEqnproduct no longer fits in 32-bit unsigned arithmetic.Using a rough relation
nnz_blocks ~= N_local * z, withzthe average nonzero-block count per row, this5 x 5threshold corresponds approximately to:17.18 millionlocal points atz = 108.59 millionlocal points atz = 20The same argument gives the corresponding first-overflow cases for larger square blocks:
6 x 6:119,304,648 * 36 = 4,294,967,3287 x 7:87,652,394 * 49 = 4,294,967,306So the threshold is high, but it is still a local sparse-matrix scale that can be reached in large implicit runs.
For reference, the current legacy CUDA kernel in
developcan overflow earlier because it uses a signed 32-bitintmatrix-scalar offset internally. The largest positive signed 32-bit value is:so the first overflowing matrix-scalar count is:
For
double, that means:which is:
16 GiBbinary17.18 GBdecimalThat estimate is specific to the legacy custom CUDA matvec path currently used in
develop.PR #2816 naturally addresses that CUDA-side issue by replacing the old custom matvec kernel, including the legacy signed-
intmatrix-offset path that creates this earlier overflow limit. That is why this PR does not modify the CUDA backend itself.What this PR changes
su2_index_t = std::uint64_tfor the host sparse-index chain that needs it.CSysMatrixsparse metadata,CSysVectorsize/index API,CPastixWrappersparse input handling, and the sparse-pattern-facing geometry accessors required for that propagation.nnz * nVar * nEqnnnz_ilu * nVar * nEqnnPointDomain * nVar * nEqnnumBlk * numVarnumBlkDomain * numVarunsigned long, but range-checks host sparse indices before converting them for the current CUDA upload path.After this change, the targeted host-side sparse-index and allocation path no longer fails by silent wraparound or silent truncation. If a value exceeds the remaining boundary types, the code now fails explicitly through
SU2_MPI::Error(...).Validation
git diff --checkpassed.For numerical validation, I ran the existing six-case correctness harness with a two-way CPU comparison:
develop CPUthis branch CPUCases:
periodic2d_sectorudf_lam_flatplate_sudf_lam_flatplate_mudf_lam_flatplate_ludf_test_11_probes_sudf_test_11_probes_mResult:
this branch CPUmatcheddevelop CPUin all 6 caseshistory.csvmatched exactly in all 6 casesmax_abs_delta = 0.0for every caseFor CUDA, the relevant correctness discussion remains PR #2816, since this branch still uses the legacy pre-#2816 custom CUDA matvec implementation and this PR intentionally does not modify that backend. In particular, the current
develop-side signed-intoffset limit described above is handled naturally by the#2816backend replacement rather than by this PR.Notes on testing
This PR does not add a direct large-memory reproducer for the original host-side overflow. The original failure mode depends on LLP64-style 32-bit
unsigned longarithmetic, and a faithful end-to-end reproduction would require a much larger local sparse structure than is practical for a routine test here.PR Checklist
--warnlevel=3when using meson).pre-commit run --allto format old commits.